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Abstract 

We  develop  and  present  results  of  an  artificial  neural  network  (ANN)  based  com¬ 
pensation  technique  for  mismatched  classifier  training  and  testing  conditions  in  a  speaker 
identification  (SID)  task.  One  ANN  per  feature  per  speaker  is  trained  to  perform  a  map¬ 
ping  of  that  feature  from  a  corrupted  condition  to  an  undistorted  condition.  Therefore, 
a  classifier  trained  under  one  condition  may  be  used  to  classify  data  collected  under  a 
different  condition. 

Speech  utterances  from  168  speakers,  collected  in  a  studio,  and  also  re-recorded  after 
transmission  over  telephone  networks,  are  used  for  developing  and  testing  the  method. 
Peak  formant  resonant  frequencies,  their  bandwidths,  and  pitch  are  used  as  features.  These 
features  from  the  studio  speech  are  used  to  train  Gaussian  Mixture  Model  classifiers. 
Portions  of  the  studio  and  telephone  speech  are  used  to  train  the  compensation  ANNs.  In 
mismatched  train  and  test  conditions,  features  from  telephone  speech  are  modified  by  the 
trained  ANNs  and  applied  to  the  GMMs  trained  with  features  from  studio  speech. 

Without  compensation,  SID  accuracy  is  6%.  The  compensation  method  developed 
in  this  work  provides  mismatch  SID  accuracy  of  58.3%.  Previous  research  on  the  same 
data  with  the  commonly  used  Mel-Frequency  Cepstral  Coefficients  as  features  and  a  typ¬ 
ical  compensation  method  of  Cepstral  Mean  Subtraction  with  Band-Limiting  gives  SID 
accuracy  of  27.4%  with  the  same  type  of  classifiers. 
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CHANNEL-MISMATCH  COMPENSATION 
IN  SPEAKER  IDENTIFICATION: 

FEATURE  SELECTION  AND  ADAPTATION 
WITH  ARTIFICIAL  NEURAL  NETWORKS 

I.  Introduction 

1.1  Overview 

Speaker  recognition  is  the  process  of  identifying  a  person  from  the  characteristics 
of  their  voice.  As  performed  on  computers,  speaker  identification  (SID)  is  the  task  of 
choosing  one  speaker  model  from  a  set  of  models  best  matching  the  utterance  given.  The 
models  are  formed  from  past  information  and  are  stored  computer  memory  for  eventual 
SID  testing  comparisons.  A  difficulty  arises  when  the  environmental  conditions  under 
which  the  models  were  formed  are  different  from  those  conditions  associated  with  the  new 
utterance  to  be  tested.  Published  attempts  to  solve  this  channel-mismatch  problem  have 
resulted  in  relatively  poor  results  [13],  [20],  [22],  while  SID  under  common  training  and 
testing  conditons  is  generally  considered  solved  [3],  [7],  [13],  [20]. 

Although  ideal  theoretically,  training  speaker  reference  models  under  the  same  degra¬ 
dations  as  the  test  features  is  often  not  realistic,  and  the  training  of  models  under  noisy 
conditions  often  still  leads  to  sub-optimum  results,  whether  or  not  the  testing  is  done 
under  better  conditions  [3],  [6].  Therefore,  the  goal  is  to  create  an  identification  system 
robust  to  mismatched  training  and  testing  conditions.  In  this  study,  the  mismatch  involves 
the  difference  between  studio-quality  speech  and  the  same  speech  distorted  over  telephone 
lines  and  equipment.  This  simulates  conditions  for  remote  access  by  deployed  military 
personnel  to  secure  electronic  equipment,  as  one  example. 
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1.2  Problem  Statement 


Investigate  features  which  are  robust  to  channel  mismatch  and  implement  a  process 
to  address  the  effects  of  mismatched  training  and  testing  conditions  on  those  features  in 
SID. 

1.3  Scope 

Features  from  the  168  test  speaker  set  of  the  Texas  Instruments  and  Massachusetts 
Institute  of  Technology’s  TIMIT  and  NYNEX’s  NTIMIT  databases  were  used  to  form 
speaker  models  and  to  manipulate  through  compensation  methods  for  SID  testing.  The 
general  objectives  to  investigate  the  previously  stated-problem  follows: 

•  Based  on  previous  research,  choose  features  proven  to  have  reasonable  speaker- 
specific  charteristics  to  achieve  high  SID  accuracy. 

•  Train  models  using  part  of  TIMIT,  a  studio-quality  database. 

•  Perform  SID  testing  using  the  speaker  models  against  only  studio-quality  speech  to 
confirm  choice  of  features  and  maximum  achievable  results. 

•  Perform  SID  testing  using  the  speaker  models  against  some  of  the  distorted  speech 
in  NTIMIT  utterances  to  achieve  baseline  statistics  on  SID  under  uncompensated, 
mismatched  conditions. 

•  Develop  channel  mismatch  compensation  technique  using  portions  of  TIMIT  and 
NTIMIT  databases  not  used  in  SID  testing. 

•  Test  compensation  technique  through  SID  testing  with  the  trained  models. 

1.4  Approach 

Based  on  Sambur’s  work  [8]  and  preliminary  experiments,  formant  resonance  fre¬ 
quencies,  their  corresponding  bandwidths,  and  pitch  were  chosen  as  reasonable  features 
to  pursue.  Baseline  statistics  were  obtained  for  SID  for  same-channel  and  cross-channel 
conditions.  That  is,  speaker  models  were  created  by  using  TIMIT  training  sentences,  and 
testing  was  done  on  TIMIT  and  NTIMIT  utterances.  Artificial  neural  networks  (ANNs) 
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were  then  used  to  compensate  for  the  effects  of  the  channel  in  NTIMIT  before  obtaining 
SID  rates  for  comparison  to  the  baseline.  The  ANNs  were  used  in  feature  mapping  or 
function  approximation,  attempting  to  undo  the  effects  of  the  telephone  channel. 

L5  Thesis  Organization 

The  remainder  of  this  thesis  is  organized  as  follows.  Chapter  II  provides  information 
on  the  theory  of  features,  the  semi-parametric  classifiers,  and  ANN  functional  mappers 
we  used.  Chapter  III  explains  the  methodology  we  implemented  in  the  techniques  and 
experiments.  Chapter  IV  contains  our  results,  and  Chapter  V  contains  our  conclusions 
and  recommendations. 
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II.  Background  Theory 


2.1  Introduction 

This  chapter  provides  the  necessary  theoretical  foundation  for  SID  and  ANNs.  The 
biological  and  mathematical  basis  for  the  selected  features  is  followed  by  a  discussion  on 
ANN  function  approximation  and  semiparametric  classification  theory. 

2.2  Process  Overview 

Figure  2.1  shows  our  general  SID  process  developed  from  the  theories  that  will  be 
discussed  in  this  chapter.  Feature  selection  and  extraction  was  used  for  both  speaker 
model  generation  and  the  creation  of  mapping  ANNs.  Original,  undistorted  features  were 
SID  tested  against  the  speaker  models  for  confirmation  of  feature  selection  and  to  establish 
maximum  expectations  for  compensation  results.  Uncompensated  features  were  SID  tested 
for  baseline  statistics  to  compare  the  results  when  we  used  features  mapped  by  the  ANNs. 


Figure  2.1  General  Diagram  of  Channel  Mismatch  Solution 


2.3  Feature  Selection 

The  most  commonly  used  and  accepted  features  for  speaker  recognition  are  Mel- 
Frequency  Cepstral  Coefficients  (MFCCs)  [3],  [5],  [6],  [10],  [12],  [13].  To  obtain  MFCCs, 
a  discrete  fourier  transform  (DFT)  is  performed  on  the  sampled  speech  segment  and  pro¬ 
cessed  through  a  series  of  triangular  filters  spaced  along  a  Mel-frequency  scale;  the  Mel- 
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frequency  scale  is  a  nonlinear  frequency  scale  representing  characteristics  of  human  hearing. 
The  log  of  the  output  magnitudes  of  the  filters  is  then  calculated  and  processed  by  a  trans¬ 
form,  usually  a  discrete  cosine  transform,  with  the  corresponding  coefficients  being  the 
MFCCs  for  the  sampled  speech  segment. 

SID  accuracy  plummets  when  MFCCs  are  use  in  conditions  with  channel  mismatch 
[13],  [20].  The  results  are  not  promising,  even  with  compensation  [13],  [20],  [22].  Therefore, 
we  sought  a  different  set  of  features.  Sambur  [8]  demonstrated  that  the  formants  and  pitch 
are  good  features  for  similar  training  and  testing  conditions,  although  not  to  the  extent 
MFCCs  have  been. 


Formants  are  the  resonant  frequencies  the  vocal  tract  imposes  onto  the  signal  coming 
from  the  diaphragm.  For  voiced  phonemes,  speech  production  generally  can  be  modeled  by 
a  quasi-periodic  pulse-train  generator  with  spectral  modulation  occurring  through  a  cavity, 
the  vocal  tract.  The  excitation  from  the  diaphragm  is  changed  by  the  glottis,  a  cartilage 
plate,  by  manipulating  and  stretching  the  adjacent  vocal  cords  as  air  is  passed  through. 
If  the  vocal  cords  are  vibrating,  phonation  occurs  and  the  segment  of  speech  is  declared 
voiced.  If  the  waveform  is  instead  aperiodic  or  random  due  to  the  vocal  cords  not  oscil¬ 
lating,  it  is  unvoiced.  Modulation  is  imposed  on  the  glottal  waveform  by  the  vocal  tract. 
The  vocal  tract  adds  its  inherent  natural  resonances  according  to  the  current  shape  of  the 
tract.  These  resonant  frequencies  are  formants,  and,  in  this  thesis,  their  peak  frequencies 
will  be  referred  to  as  formants.  The  particular  formant  spectrum  structures  resulting  from 
the  unique  shapes  of  the  vocal  tract  characterize  all  vowels  and  some  consonants  in  the  En¬ 
glish  language  [5].  The  fundamental  frequency,  the  reciprocal  of  the  fundamental  period, 
is  called  the  pitch.  An  example  is  given  in  Figure  2.2,  showing  a  sampled  speech  segment 
and  its  spectrum  with  the  first  four  formants  from  the  vowel  /  a/  as  in  ”at”. 


Frequency  (kHz) 


Time  (msec) 


Figure  2.2  Time  Plot  and  Corresponding  Formant  Spectrum  from  Vowel  ”a”  in  ”at”  [4] 
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Sambur  [8]  performed  feature  saliency  tests  on  92  features  including  formants,  band- 
widths,  pitch,  formant  contours,  glottal-source  poles,  and  nasal  pole  locations.  These 
general  parameters  were  further  subdivided  within  the  92  by  being  from  particular  words; 
for  example,  one  of  the  best  features  he  cited  was  the  third  formant  in  the  vowel  u.  In  his 
final  list  of  best  features,  most  of  the  top  twenty  were  the  first  four  formants  and  pitch 
from  different  phonemes;  none  of  these  five  were  particularly  dominant  over  the  other  four 
in  this  list.  Besides  Sambur,  Parsons  [5]  also  recommends  using  at  least  three  formants  for 
speaker  recognition,  which  includes  SID.  And  since  bandwidths  are  associated  with  these 
formants,  we  felt  they  should  be  included  along  with  formants  and  pitch  and  potential 
features  for  addressing  the  channel  mismatch  problem. 

2-4  ANNs  for  Approximating  Functions 

Although  multilayer  perceptron  (MLP)  artificial  neural  networks  (ANNs)  are  often 
used  for  classification  of  data,  they  can  be  used  to  approximate  functions  also.  Refer 
to  the  appendix  for  background  ANN  theory.  More  specifically,  MLP-ANNs  with  two 
layer  of  weights  and  nonlinear  activation  functions  can  approximate  arbitrarily  well  any 
continuous  functional  mapping  from  one  finite-space  to  another.  This  fact  is  true  as  long 
as  the  number  of  hidden  layer  units  is  sufficient  and  the  number  of  target  nodes  does 
not  exceed  the  number  of  input  nodes  [1].  There  is  a  wealth  of  published  papers  on  this 
subject  [23],  [24],  [25],  [26],  [27],  [28],  [29],  [30],  [31]. 

Instead  of  training  the  ANNs  to  connect  certain  inputs  to  particular  classes,  in  our 
application  we  train  ANNs  to  function  like  an  inverse  channel  filters  by  having  the  output 
training  targets  be  the  values  we  desire  the  corresponding  training  inputs  to  become.  Then 
we  input  the  corrupted  test  features  to  the  ANNs  with  the  purpose  of  ’’cleaning”  those 
features. 

In  very  general  terms,  adding  additional  layers  can  cause  a  decrease  in  typical  ANN 
error  criterion.  More  specifically,  with  nonlinear  function  approximations,  multiple  lay¬ 
ers  with  nonlinear  activation  functions  lead  to  a  much  higher  probability  of  lowering  the 
standard  error  parameter,  Sum-Squared-Error  (SSE),  to  an  acceptable  level  [1].  The  use 
of  backpropagation,  whereby  the  error  derivatives  of  the  network  outputs  are  propagated 
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back  to  the  hidden  layers  to  be  used  in  their  error  metric,  is  a  technique  for  adjusting 
weights  to  minimize  SSE.  While  updating  the  weights  based  on  the  errors  relating  to  the 
weights  can  lead  to  convergence,  the  use  of  an  adaptive  learning  rate  and  momentum 
greatly  assists  finding  a  global  minimum.  The  adaptive  learning  rate  is  a  fractional  mul¬ 
tiplicative  term  for  adjusting  the  change  in  weight  updates  according  to  the  level  of  error 
change.  For  example,  it  is  desirable  to  make  larger  changes  in  weight  values  when  the  SSE 
is  decreasing  rapidly.  Momentum  assists  in  centering  on  a  global  minimum  by  causing  the 
weight  changes  to  be  based  on  previous  weight  changes  [19]. 

2.5  Mixture  Classification  Theory 

Statistical  classifiers  require  estimates  of  class  conditional  probability  density  func¬ 
tions  (PDFs).  Semiparametric  methods  of  classification  often  prove  to  be  ideal  since  they 
can  combine  good  aspects  of  both  nonparametric  and  parametric  approaches.  Avoiding 
the  problem  of  model  growth  directly  with  the  size  of  the  data  set,  the  model  only  be¬ 
comes  more  sophisticated  with  data  expansion.  One  type  of  a  semiparametric  method  is 
the  Gaussian  Mixture  Model  (GMM).  GMMs,  given  the  necessary  number  of  components 
with  corresponding  appropriate  parameters,  can  approximate  any  non-disjoint  density  to 
a  desired  accuracy  [1].  GMMs  have  been  applied  with  great  success  to  SID  tasks,  approx¬ 
imating  even  multimodal  PDFs  very  well  [7]. 

The  GMM  is  simply  a  linear,  weighted  combination  of  M  basis  functions,  here  normal 
probability  density  functions: 

M 

p(x)  =  X>(*  I  j)P(i)  (2-1) 

J  = 1 

where  p(x  \  j)  is  a  normal  probability  density  function, 

p(x|i)=vibexp"^^  (2'2> 

and  P(j )  is  the  mixture  weight,  Hj  is  the  mean,  and  <7j  is  the  standard  deviation  for 
mixture  component  j. 
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After  the  GMM  is  formed  by  inputting  training  data,  an  error  metric,  the  log1' 
likelihood  equation,  is  used  for  a  set  of  test  observations  [1]  : 

N  N  M 

E  =  -lnC  =  -'£\n  p(xn)  =  -  £  In  I  jW)}-  (2-3) 

n— 1  n= 1  jf=l 

The  error  is  minimized  by  maximizing  the  likelihood  score,  i.e.  the  Maximum  Aposteriori 
Probability  (MAP).  Therefore,  in  SID  the  GMM  speaker  model  with  the  highest  likelihood 
score  given  the  utterance  would  be  considered  the  identified  speaker.  See  the  Appendix 
for  more  information  on  classification  theory. 

2 . 6  Summary 

Based  on  poor  performance  of  SID  under  mismatched  conditions  with  MFCCs  [13], 
[20],  [22]  and  work  by  Sambur  [8],  formants,  bandwidths,  and  pitch  were  chosen  as  features. 
As  good  features  for  modeling  the  vocal  tract,  these  features  must  be  taken  from  speech 
segments  declared  voiced,  since  those  type  of  features  better  model  the  entire  vocal  tract 
than  those  from  unvoiced  speech.  ANNs  were  chosen  as  a  transforming  compensation 
technique  for  the  effects  of  mismatch  on  the  features.  The  mapped  features  were  used  in 
final  testing  with  Gaussian  Mixture  Speaker  Models,  which  have  been  shown  [7]  to  have 
good  performance  characteristics  in  SID  applications. 
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III.  Approach  and  Methods 


3.1  Introduction 

This  chapter  describes  the  databases  and  expands  on  feature,  ANN,  and  GMM  the¬ 
ories  by  demonstrating  their  use  for  within-channel  and  cross-channel  SID  applications. 

3.2  TIMIT  and  NTIMIT  Databases 

We  used  the  168  test  speaker  set  of  the  Texas  Instruments  and  Massachusetts  In¬ 
stitute  of  Technology’s  TIMIT  and  NYNEX’s  NTIMIT  databases.  NTIMIT  is  rerecorded 
TIMIT  with  the  use  of  a  carbon-button  telephone  handset  and  artificial  mouth  sent  over 
various  length  telephone  fines  and  looped  back  for  recording  at  a  16  kHz  sampling  rate  [15]. 
Although  originally  designed  for  speech  recognition  research,  TIMIT  is  a  good  database  for 
SID  under  an  almost  ideal  environment  given  it  specifications:  eight  KHz  bandwidth,  min¬ 
imum  equipment  noise  and  variability,  and  depth  in  phonetic  diversity  in  approximately 
three  second  utterance  lengths. 

The  168  speakers  are  divided  into  eight  dialect  regional  subsets,  varying  size  from  11 
to  32  speakers.  Ten  sentences  were  recorded  from  each  of  the  168  speakers.  For  diversity, 
the  ten  sentences  per  speaker  are  divided  into  two  sentences  with  sa  designations,  three 
with  si,  and  five  with  sx.  The  two  sa  are  identical  for  all  speakers,  while  the  si  and  sx 
sentences  are  not.  For  testing  purposes,  the  last  two  sx,  by  using  the  numerical  ordering 
from  UNIX  Is  command,  were  used  for  testing  as  the  other  eight  sentences  were  used 
in  training.  The  training  set  included  the  two  sa  to  insure  some  completely  common 
conditions  for  all  speakers  for  GMM  generation  [7].  The  two  sx  were  used  to  simulate  more 
of  a  real-world  text-independent  condition. 

3.3  Preprocessing  of  Features 

3.3.1  ESPS  Feature  Extraction  Commands.  Figure  3.1  helps  clarify  the  follow¬ 
ing  discussion.  The  pitch  values  in  each  frame  and  the  corresponding  probabilities  of 
voiced  speech  were  obtained  with  ESPS  5.1’s  getfO  command  through  a  C-shell  script. 
The  command  getfO  is  similar  to  an  algorithm  developed  by  Secrest  and  Doddington  [14]. 
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Using  a  correlation  function  on  filtered  (one-pole  model)  linear  prediction  residuals,  poten¬ 
tial  pitch  values  are  obtained  and  then  used  along  with  spectral  consistency  penalties  over 
many  frames  and  voicing  state  information  to  make  final  determination  of  pitch  terms  [16]. 

The  formant  and  bandwidths  are  obtained  via  ESPS  5.1 ’s  formant  command  in  a 
C-Shell  script.  They  are  selected  from  potential  values  given  by  the  solution  for  the  linear 
prediction  polynomials’  roots.  Cost  penalties  are  imposed  on  these  candidates,  and  the 
final  terms  are  a  result  of  a  modified  Viterbi  algorithm  for  achieving  an  optimum  mapping 
regarding  consistency  over  multiple  frames  of  the  speech  waveform.  A  preemphasis  value  of 
0.7  was  used  [4],  [6],  [16].  With  a  desire  to  achieve  the  multimodal  sharpness  of  the  spectra 
without  distortion,  we  chose  a  linear  prediction  coefficient  (LPC)  order  of  twelve  [6],  [16]. 
And  for  smoothing  of  the  frame  transitions,  a  Hamming  window  was  applied  to  each  20 
millisecond  frame  with  a  ten  millisecond  step  size.  The  preemphasis  value,  the  LPC  order, 
and  the  step  size  were  all  software  defaults.  The  other  parameters  were  common  choices  [6]. 
The  choice  of  these  values  was  not  of  central  importance  to  this  thesis,  as  was  the  SID 
performance  when  these  features  are  undistorted,  corrupted,  and  transformed. 


Figure  3.1  Diagram  of  Speaker  Modeling,  Baseline  SID,  and  SID  after  Compensation 

3.3.2  Matlab  Feature  Preprocessing.  After  the  generation  of  outputs  from  the 
formant  and  getfO  commands,  the  formants,  bandwidths,  pitch,  and  probabilities  of  voiced 
speech  were  input  to  a  data  manipulation  program  written  in  Mathworks  Matlab  4.2c.  As 
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evident  by  file  headers,  the  outputs  of  each  of  the  commands  have  different  starting  times 
which  caused  practically  a  one-frame  shift  between  the  two.  Also,  getfO  might  have  one  and 
occasionally  two  extra  frames,  disregarding  the  shift  mentioned.  Therefore,  correspondence 
had  to  be  achieved  primarily  since  the  ANNs  are  to  be  direct  feature  mappers.Furthermore, 
accurate  values  of  pitch  (being  a  measurement  of  periodicity)  and  formants  for  proper 
modeling  of  the  vocal  tract  can  only  come  from  voiced  segments.  Thus,  we  needed  to 
eliminate  unvoiced  frames.  We  did  this  by  employing  the  probabilities  of  voiced  speech 
to  eradicate  useless  data.  The  use  of  0.5  as  a  decision  threshold  for  voicing  probabilities 
caused  no  complications  as  the  probabilities  from  getfO  were  typically  zero  or  one,  with 
occasionally  a  one-one  millionth  term  or  a  0.99  value. 

Then,  since  first  formant  ranges  for  vowels,  voiced  fricatives  and  voiced  stops  are 
typically  below  1000  Hz  [4],  [6]  and  pitch  values  in  normally  read  sentences  should  not 
generally  exceed  160  Hz  for  males  and  400  Hz  for  females  [5],  criteria  for  unlikely  values 
can  be  developed.  Extreme  outliers,  such  as  first  formants  at  2000  Hz,  triggered  a  part  of 
the  preprocessing  algorithm  to  eliminate  the  entire  feature  vector,  since  other  values  might 
also  be  impacted.  We  chose  a  threshold  of  plus  or  minus  two  standard  deviations,  as  this 
was  found  to  typically  eliminate  about  three  percent  of  the  original  feature  vectors  of  each 
speaker’s  feature  matrices.  The  use  of  probabilities  of  voiced  speech  eliminated  about  50% 
of  the  typically  250  original  feature  vectors  per  utterance. 

3-4  HTK  Gaussian  Mixture  Models 

GMMs  can  be  implemented  with  HTK  2.0,  since  GMMs  are  single-state  Hidden 
Markov  Models  (HMMs).  GMMs  were  created  by  using  feature  vectors  from  the  eight 
training  sentences  from  each  speaker  in  a  C-shell  script  which  included  HTK  commands 
HInit,  HRest,  and  HHEd.  HInit  provides  initial  estimates  for  the  means  and  variances 
of  the  component  densities  in  a  GMM.  HInit  functions  by  repeatedly  segmenting  training 
data  by  Viterbi  alignment  and  recalculating  the  means  and  variances  using  a  K-Means 
clustering  algorithm.  HRest  uses  an  EM/Baum- Welch  algorithm  for  re-estimation  of 
the  GMM  parameters  to  best  model  the  feature  vectors’  probability  density  for  individual 
speakers;  this  theory  is  further  discussed  in  the  Appendix.  We  set  the  variance  floor  at 
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0.01  for  comparison  to  previous  research  [7],  [13].  After  initially  modeling  with  one,  the 
amount  of  non-defunct  mixture  components  was  grown  by  an  option  with  a  mixture  editor, 
HHEd.  The  mixture  with  largest  weight  is  then  duplicated.  Each  twin  mixture  then  has 
their  weight  halved  and  is  offset  from  the  original  mean  by  plus  or  minus  two  standard 
deviations  [13],  [17]. 

For  SID,  a  C-Shell  script  is  implemented  which  uses  the  HTK  2.0  HVite  command. 
This  command  uses  Viterbi  algorithmic  techniques  to  test  trained  models  with  a  set  of 
feature  vectors  from  a  speaker.  The  object  is  to  find  the  model  with  the  greatest  log- 
likelihood  score  given  a  set  of  feature  vectors  from  a  particular  utterance. 

3.5  Feature  Mapping  Artificial  Neural  Networks 

The  main  problem  to  address  is  the  nonlinear  nature  of  the  handset,  while  accounting 
for  the  bandwidth  limitations  of  the  telephone  networks  [3].  This  problem  was  evident 
from  equipment  specification,  spectrum  study,  previous  research  [3] ,  and  some  attempts  at 
solving  the  channel  mismatch  problem  based  on  linear  models  such  as  Missing  Features  [11]. 
As  previously  discussed,  ANNs  became  a  logical  choice  to  address  this  nonlinear  problem. 
Since  the  speaker  models  were  trained  on  TIMIT,  it  would  be  ideal  if  the  channel-distorted 
test  utterances  from  NTIMIT  actually  could  be  made  to  appear  as  if  no  distortion  had 
occurred.  We  used  Mathworks  Matlab  Neural  Networks  Toolbox  (MM-NNT)  2.0b  to 
construct  a  training  method  for  function- approximating  ANNs.  The  method  inputs  all 
the  particular  speaker’s  training  feature  matrices  from  NTIMIT  and  sets  a  corresponding 
target  output  of  the  TIMIT  feature  matrices.  This  is  supervised  training  [1],  [9]. 

The  first  experiments  were  done  by  inputting  the  parts  of  the  feature  matrices  with 
only  formants,  and  setting  the  target  output  as  only  one  feature.  Therefore,  one  ANN  per 
feature  per  speaker  had  to  be  created;  see  Figure  3.2.  We  used  this  architecture  and  set  of 
features  due  to  Sambur’s  list  of  best  features  [8].  An  algorithm  for  insuring  convergence 
and  a  reasonable  SSE  was  included  in  the  ANN-generation  program.  The  resultant  trained 
ANNs  often  took  longer  than  originally  anticipated  since  convergence  was  not  automatic 
given  the  initial  weights  of  each  training  iteration.  Reaching  an  acceptable  SSE  often 
occurred  only  after  several  loop  iterations.  Each  loop  iteration  caused  a  reinitialization  of 
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The  Training  of  tha  Faatnra  Mapping  Artificial  Neural  Networks 


Feature  matrices 
from  the  first  eight 
NTIMIT  TRN 
utterances/spkr 
are  the  input; 
"known  quantity" 


Elements  of  single 
features  from  the 
first  eight  TIMITTRN 
utterances/spkr 
are  the  target; 
"known  quantity” 


The  number  of  features  used  determines  the  number  of  artificial  neural  networks  per  speaker. 


Figure  3.2  The  Configuration  for  Training  the  Feature  Mapping  ANNs 


ANN  weights  given  by  the  MM-NNT  initff.m  routine  at  the  beginning.  This  last  step  was 
needed,  since  experiments  were  completed  on  a  smaller  scale  to  obtain  the  correct  number 
of  hidden  node  weights  and  ANN  parameters. 

After  the  complete  training  of  each  speaker’s  ANNs  for  each  feature,  the  feature  vec¬ 
tors  from  individual  test  utterances  were  input  to  each  of  the  ANNs  as  shown  in  Figure  3.3. 
The  transformed  feature  outputs  from  each  ANN  were  then  recombined  into  a  transformed 
NTIMIT  feature  matrix.  As  evident  by  Chapter  IV  of  this  thesis,  some  features  appeared 
from  sample  study  to  not  be  transform  well  to  the  corresponding  TIMIT  feature  matrix. 
Therefore,  other  feature  combinations  were  also  tried  in  SID,  such  as  using  one  or  more 
original  NTIMIT  features  along  with  other  transformed  features. 

To  elaborate  further  on  the  neural  network  architecture,  a  number  of  attempts  were 
made  to  find  the  parameters  which  were  optimized  in  the  sense  of  SSE.  When  consider¬ 
ing  four  formants  as  inputs  and  one  as  output,  the  trials  with  subsets  of  speakers  were 
performed  with  hidden  nodes  ranging  from  three  to  25.  ANNs  with  twelve  hidden  nodes 
achieved  optimum  results,  though  up  to  20  was  close.  We  implemented  the  idea,  partially 
based  on  previous  research,  that  the  number  of  hidden  layer  nodes  increasing  with  input 
layer  complexity  could  be  tied  to  their  ratio.  In  this  case,  the  ratios  giving  best  results  had 
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The  Usage  of  the  Feature  Mapping  Artificial  Nfimal  Networks 


Feature  matrices 
from  the  last  two 
NTIMITTST 
utterances/spkr 
are  the  input; 
"known  quantity" 


Elements  of  single 
features  from  the 
mapped  NTIMIT 
utterances/spkr 
are  the  output; 
"unknown  quantity" 


The  outputs  are  then  grouped  into  matrices  to  be  used  in  HVite  SID. 


Figure  3.3  The  Configuration  for  Transforming  the  Features  through  the  ANNs 

ratios  of  nodes  between  3:1  and  5:1.  This  empirical  theory  was  used  with  other  feature 
combinations  for  validity.  With  nine  inputs  of  formants,  bandwidths  and  pitch,  45  hidden 
nodes,  i.e.  5:1  ratio,  gave  the  best  results.  Decreasing  the  amount  of  nodes  towards  3:1 
ratios  caused  a  rapid  degradation  in  accuracy.  Increasing  to  72  hidden  nodes  yielded  good, 
but  lower  percentages.  We  also  noted  the  ANNs  with  72  nodes  took  two  to  three  times  as 
long  to  train  as  those  with  45  hidden  nodes. 

The  ANN  parameter  search  was  appropriate  for  the  full  speaker  set  though  done  on 
a  subset,  since  the  ANNs  were  trained  on  individual  speakers.  We  settled  on  a  momentum 
rate  of  0.9  with  a  learning  rate  decrease  (LRD)  of  0.5,  versus  the  cited  rates  of  0.95  [19] 
and  default  of  0.7,  after  a  number  of  subset  trials.  The  LRD  is  a  multiplying  fraction 
which  reduces  the  momentum  term  when  an  error  increase  is  encountered. 


The  initial  parameter  search  trials  also  demonstrated  the  combination  of  the  non¬ 
linear  activation  function  logistic  sigmoid  for  the  output  layer  and  hyperbolic  tangent  for 
the  hidden  layer  proved  to  be  the  optimum  pair.  Before  processing,  the  feature  elements 
were  all  divided  by  10000  to  keep  them  within  the  range  of  the  activation  functions.  After 
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being  transformed,  they  were  multiplied  by  that  same  constant  to  achieve  useful  values  for 
classification. 

3.6  Summary 

The  168  test  speaker  set  of  TIMIT  and  NTIMIT  were  used  in  this  study.  Although 
originally  intended  for  use  in  speech  recognition  research,  the  databases  characteristics, 
including  richness  in  phonemes  and  dialects,  make  it  suitable  for  use  in  SID.  The  extrac¬ 
tion  of  features  from  these  sampled  speech  databases  was  done  primarily  through  the  use 
of  ESPS  5.1’s  tools  formant  and  getfO.  They  both  use  linear  prediction  techniques  for 
estimation,  then  assign  cost  penalties  related  to  consistency  to  assist  in  making  final  value 
determinations. 

Recognizing  GMMs  are  single-state  HMMs,  HTK  2.0  provided  the  means  for  gen¬ 
erating  GMMs  for  each  speaker  in  the  databases.  A  C-Shell  script  with  the  initializing 
( HInit ),  growing  ( HHEd)}  and  parameter  re-estimating  ( HRest )  of  the  GMMs  was  used. 

To  address  the  apparent  nonlinear  nature  of  the  mismatch  problem,  ANNs  were 
created  to  transform  corrupted  features  towards  undistorted  ones.  After  small  scale  trials 
that  determined  the  general  ANN  architecture,  it  became  clear  features  needed  to  be  fed 
into  an  individual  ANN  for  each  mapped  feature  desired.  Various  feature  combinations 
with  formants,  bandwidths,  and  pitch  were  tried  with  different  numbers  of  hidden  nodes 
and  adjustments  of  ANN  training  parameters.  Through  experimentation,  we  determined 
fairly  optimum  ANN  architectures  before  total  feature  mapping  and  SID  experiments  were 
performed. 
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IV.  Results 


4-1  Introduction 

This  chapter  provides  baseline  statistics  on  within-channel  and  cross-channel  condi¬ 
tions.  These  will  be  followed  with  the  results  of  SID  with  features  transformed  by  ANN 
channel  compensation  techniques.  A  discussion  of  the  results  includes  explanations  for 
certain  feature  combinations  out-performing  others. 

4-2  Mixture  Models  with  Baseline  Statistics 

Reynolds  [15]  showed  that  the  best  method  for  finding  the  optimum  mixture  amount 
was  through  empirical  methods.  Bishop  [1]  alludes  to  this  parameter  search  for  optimiza¬ 
tion  regarding  GMMs.  Therefore,  a  search  was  done  to  determine  the  appropriate  number 
of  mixture  components  for  maximum  accuracy.  Some  initial  results  demonstrated  there 
was  not  enough  diversity  in  the  data  to  support  the  32  mixtures  Reynolds  found  to  be 
optimum  for  his  use  [15].  So  we  tested  GMMs  with  maximum  mixtures  of  two  to  16  (or 
more  as  necessary)  until  a  peak  and  accuracy  descent  was  found. 

4.3  Train  and  Test  on  TIMIT  (T/T) 

As  the  results  of  Sambur  [8]  indicated,  the  use  of  four  formants  and  pitch  provided 
respectable  SID  accuracy  rates  with  a  peak  of  82.7%  with  GMMs  of  ten  mixtures  while 
training  and  testing  on  TIMIT.  In  the  tables  and  charts,  /  represents  formant  frequency,  b 
stands  for  bandwidth,  and  p  represents  pitch;  the  adjacent  numbers  indicate  the  particular 
formant  frequency  or  bandwidth. 

It  is  reasonable  to  hypothesize  from  Sambur’s  work  the  elimination  of  one  of  these 
good  features  would  lead  to  a  decrease  in  accuracy.  This  is  demonstrated  by  the  results 
when  one  formant,  the  fourth,  was  eliminated  as  well  as  when  pitch  was  not  included. 
Observing  the  results  with  ten  mixture  components  in  Table  4.1  and  Figure  4.1,  the  results 
dropped  to  50.9  %  and  69.3  %  ,  respectively,  with  little  improvement  with  different  amounts 
of  mixtures.  This  demonstrates  the  value  of  pitch  and,  especially,  the  fourth  formant.  The 
bandwidths  were  included  in  this  study  and  found  to  be  valuable,  increasing  the  within- 
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Table  4.1  SID  Accuracy  on  T/T  and  T/N  for  Several  Combinations  o: 


Number  of 

T/T 

T/N 

T/T 

T/N 

T/T 

T/N 

T/T 

T/N 

Mixtures 

fblfb2 

fblfb2 

flf2 

flf2 

flf2 

flf2 

flf2 

flf2 

per  GMM 

fb3fb4p 

fb3fb4p 

f3f4p 

f3f4p 

f3f4 

f3f4 

f3p 

f3p 

2 

72.6 

5.9 

59.8 

6.0 

38.0 

2.1 

4 

86.0 

6.3 

80.4 

9.2 

35.0 

2.7 

6 

91.1 

6.8 

80.4 

6.0 

69.9 

3.0 

8 

92.3 

5.9 

81.3 

7.1 

69.9 

3.0 

52.7 

12.2 

10 

92.3 

6.3 

82.7 

7.7 

69.3 

2.4 

50.9 

11.3 

12 

91.1 

6.3 

79.2 

7.1 

70.5 

2.4 

55.7 

10.4 

14 

90.0 

6.5 

78.9 

7.1 

71.7 

2.4 

54.2 

12.2 

16 

87.5 

5.9 

76.2 

6.5 

71.7 

7.4 

52.1 

11.9 

18 

70.5 

20 

68.2 

22 

66.4 

24 

64.3 

Features 


channel  SID  accuracy  to  a  peak  of  92.3  %.  Some  additional  conclusions  to  be  drawn  relate 
to  the  usefulness  of  the  features  with  reasonable  values  of  cross-correlation,  i.e.  formants 
and  bandwidths,  to  be  used  in  an  ANN  for  mapping  of  testing  feature  matrices  from 
NTIMIT. 


4-4  Train  on  TIMIT  and  Test  on  NTIMIT  (T/N) 

There  is  a  significant  drop  in  SID  accuracy  when  measuring  SID  with  channel  mis¬ 
match  conditions. 

Observing  Table  4.1  and  Figure  4.2,  the  best  results  seem  to  occur  with  three  formants 
and  pitch;  with  eight  or  14  components,  SID  accuracy  was  12.2%.  This  seems  to  verify 
the  need  for  a  good  fourth  formant  and,  conversely,  the  degradation  with  a  poor  fourth 
formant.  The  degradation  occurs  by  both  the  upper  bandwidth  limitations  of  the  channel 
and  the  nonlinearites  of  the  microphone  [3],  The  bandwidths  made  little  difference,  but  the 
pitch  estimates,  in  a  relative  sense,  were  valuable  as  a  feature  under  those  test  conditions. 
We  feel  the  few  percentage  differences  are  of  no  great  significance,  since  SID  rates  of  10% 
are  of  no  practical  use. 
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Train  TIMIT/Test  TIMIT  with  Combinations  of  Formants, Bandwidths,  and  Pitch 


Figure  4.1  T/T  with  Combinations  of  Formants,  Bandwidths,  and  Pitch 

4.5  Train  on  TIMIT  and  Test  on  Transformed  NTIMIT  Features  (T/ANN) 

The  results  of  the  ANNs  were  promising.  See  Tables  4.2  through  4.5  and  Figures  4.3 
through  4.6,  while  recognizing  a  mapped  feature  is  signified  by  a  tilda  on  top.Although 
valuable  as  a  feature,  the  pitch  from  the  ANNs  both  in  early  small  scale  attempts  and 
with  the  full  ANNs  did  not  map  well  to  the  desired  TIMIT  feature  elements.  The  original 
NTIMIT  pitch  was,  therefore,  usually  substituted. The  use  of  all  nine  available  features  is 
shown  in  Tables  4.4  and  4.5  and  Figures  4.5  and  4.6.  The  accuracies  each  increase  about 
20%  with  ANNs  with  45  hidden  nodes  and  15%  with  72  nodes,  regardless  of  the  amount 
of  mixtures  in  the  models,  when  uncompensated  pitch  is  substituted  for  the  transformed 
pitch. 

Just  as  the  importance  of  the  fourth  formant  with  or  without  its  bandwidth  was 
shown  by  Sambur,  it  is  also  shown  by  the  pre-  and  post-compensation  accuracies.  It 
seems  the  ANNs  are  able  to  reconstruct  the  missing  fourth  formant  from  the  available 
formant  structure  and  the  cross-correlation  of  the  features.  The  peak  accuracy  for  all 
three  transformed  formants  with  uncompensated  pitch  was  23.5%,  while  the  use  of  the 
fourth  formant  in  the  ANNs  and  as  a  feature  increased  this  combination’s  rate  to  31.0%. 
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Train  TIMfT/Test  NTIMIT  with  combinations  of  Formants, Bandwidth s,  and  Pitch 


Figure  4.2  T/N  with  Combinations  of  Formants,  Bandwidths,  and  Pitch 

In  fact,  this  addition  raised  the  SID  rates  with  all  mapped  combinations,  peaking  at  34.2% 
if  all  the  uncompensated  second  formants  were  used  instead  of  the  transformed  versions. 

The  incorporation  of  bandwidth  as  another  feature  and  ANN  input  boosted  rates 
again,  peaking  at  58.3%  with  45  nodes  and  54.2%  with  72  nodes  when  uncompensated 
pitch  was  the  sole,  unmapped  feature.  Much  of  this  data  also  shows  the  presence  of  the 
fourth  formant  in  the  ANNs  greatly  assists  in  the  reconstructing  of  the  third  and,  to  a 
lesser  extent,  second  formant. 

4.6  Summary 

The  use  of  all  features  easily  outperformed  any  feature  subset  combination  when  no 
channel  mismatch  is  involved.  The  SID  rate  of  92.3%  with  four  formants,  bandwidths, 
and  pitch  was  excellent,  confirming  the  choice  of  features  as  reasonable  for  SID.  All  rates 
plummet  to  about  ten  percent  or  less  upon  cross-channel  conditions.  Very  promising 
results  occurred  with  the  use  of  certain  combinations  of  transformed  features.  Generally, 
the  uncompensated  pitch  provided  better  data  as  the  ANNs  were  apparently  not  able  to 
transform  the  pitch  well.  And  the  importance  of  the  fourth  formant  was  established  in 
several  ways.  They  include  the  better  uncompensated,  cross-channel  performance  when 


4-4 


Table  4.2  SID  Accuracy  with  Combinations  of  Compensated  and  NTIMIT  Features  for 


T 

tiree  Formants  and  Pitch  Using  ANNs  (45  Hidden 

Modes) 

Number  of 

(T/ANN) 

(T/ANN) 

(T/ANN) 

(T/ANN) 

Mixtures 

per  GMM 

T/T 

T/N 

71/2/3? 

/l/2/3p 

7l/2f3p 

7l/2/3P 

2  ”1 

37.5 

7.7 

4 

48.8 

9.8 

6 

50.0 

11.9 

8 

52.7 

12.2 

21.1 

17.6 

18.2 

14.0 

10 

50.9 

11.3 

23.5 

19.0 

13.1 

14.9 

12 

55.7 

10.4 

21.4 

18.4 

14.3 

14.3 

14 

54.2 

12.2 

22.0 

18.2 

16.7 

12.3 

16 

52.1 

11.9 

22.0 

17.6 

16.7 

14.0 

Table  4.3  SID  Accuracy  with  Combinations  of  Compensated  and  NTIMIT  Features  for 
Four  Formants  and  Pitch  Using  ANNs  (45  Hidden  Nodes) 


Number  of 

(T/ANN) 

(T/ANN) 

(T/ANN) 

(T/ANN) 

Mixtures 

per  GMM 

T/T 

T/N 

7172/3/4? 

TlJVWP 

7772/374? 

77/27374P 

8 

81.3 

7.1 

29.8 

22.0 

19.6 

33.6 

Iflfl 

31.0 

24.4 

22.6 

34.2 

mm 

29.2 

26.2 

32.7 

14 

78.9 

mm 

27.4 

25.6 

21.4 

32.7 

16 

76.2 

6.5 

29.2 

23.8 

20.2 

Table  4.4  SID  Accuracy  with  Combinations  of  Compensated  and  NTIMIT  Features  for 
Four  Formants,  Bandwidths  and  Pitch  Using  ANNs  (45  H-Nodes) 


Number  of 

Mixtures 

I 

(T/ANN) 

(T/ANN) 

(T/ANN) 

(T/ANN) 

per  GMM 

/  blf  62/ 63/  bip 

fbifbifbifbip 

fbifbifbsfbip 

Jbifb2fbZfblp 

8 

92.3 

5.9 

58.3 

38.1 

33.3 

18.4 

10 

92.3 

6.3 

53.9 

35.4 

32.1 

22.9 

12 

I1SI 

6.3 

54.2 

35.7 

33.6 

17.6 

14 

90.0 

6.5 

50.3 

32.1 

30.4 

27.4 

16 

87.5 

5.9 

49.4 

32.4 

29.2 

17.9 
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Table  4.5  SID  Accuracy  with  Combinations  of  Compensated  and  NTIMIT  Features  for 
Four  Formants,  Bandwidths  and  Pitch  Using  ANNs  (72  H-Nodes) 


Number  of 

(T/ANN) 

(T/ANN) 

(T/ANN) 

(T/ANN) 

Mixtures 

T/T 

T/N 

per  GMM 

fbifb2fbifbip 

fbifbifbZfbip 

fbifb2fbifbip 

/61/62/63/54p 

8 

92.3 

5.9 

54.2 

36.0 

36.0 

19.0 

10 

92.3 

6.3 

53.0 

37.5 

35.4 

18.4 

12 

91.1 

6.3 

49.7 

32.1 

31.0 

17.6 

14 

90.0 

6.5 

45.2 

31.0 

29.2 

15.2 

16 

87.5 

5.9 

47.6 

34.2 

31.8 

17.3 

not  used,  its  effects  when  transformed  properly,  and  its  value  in  reconstructing  the  third 
and,  to  a  lesser  extent,  the  second  formant.  The  best  results  were  when  all  nine  features 
were  input  to  each  ANN,  and  all  transformed  output  features  were  used  except  pitch;  the 
distorted  version  was  substituted  as  closer  to  TIMIT  target.  The  SID  accuracy  peaked  at 
58.3%,  up  from  5.9%  with  no  compensation  for  eight  mixtures  components. 
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Train  TIMITHest  ANN  NTIMIT  with  3  Formants  and  Pitch 


Figure  4.3  T/ANN(45  Nodes)  with  Combinations  of  Compensated  and  NTIMIT  Fea¬ 
tures  for  Three  Formants  and  Pitch 


Train  TIMIT/Test  ANN  NTIMIT  with  of  4  Formants  and  Pitch 


Figure  4.4  T/ANN(45  Nodes)  with  Combinations  of  Compensated  and  NTIMIT  Fea¬ 
tures  for  Four  Formants  and  Pitch 
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Figure  4.5 


Figure  4.6 


Train  TIMIT/Test  ANN  NTIMIT  with  4  Formants,  4  Bandwidths,  and  Pitch 


T/ANN(45  Nodes)  with  Combinations  of  Compensated  and  NTIMIT  Fea¬ 
tures  for  Four  Formants,  Bandwidths  and  Pitch 


Train  TIMITfTest  ANN  NTIMIT  with  4  Formants,  4  Bandwidths,  and  Pitch 


T/ANN(72  Nodes)  with  Combinations  of  Compensated  and  NTIMIT  Fea¬ 
tures  for  Four  Formants,  Bandwidths  and  Pitch 
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V.  Conclusions 


5.1  Compensation  Technique  Development 

Through  early  testing  with  other  compensation  methods  [11],  [21],  previous  research 
[3],  and  spectrum  analysis,  it  became  apparent  to  us  the  chief  cross-channel  SID  problem 
was  the  nonlinear  distortion  imposed  by  the  microphone.  We  felt  a  compensation  technique 
using  ANNs  should  be  developed.  After  many  architectural  attempts,  best  results  were 
found  when  using  all  nine  features  to  train  individual  ANNs  for  mapping  each  feature  for 
each  speaker. 

5.2  Baseline  Testing 

With  channel  conditions  kept  consistent  with  training  and  testing,  good  SID  ac¬ 
curacies  of  over  92%  were  obtained  by  GMMs  with  formants,  bandwidths,  and  pitch  as 
features.  This  accuracy  result  experimentally  confirmed  our  choice  of  features  and  our  use 
of  GMMs  as  a  classification  tool.  Telephone  equipment  and  network  distortions  caused 
large  degradations  in  the  ability  to  use  the  previously  trained  GMMs  for  SID.  When  we 
used  ANNs  to  compensate  for  the  distortion  effects,  improvements  were  substantial  with 
various  feature  combinations.  For  example,  the  peak  accuracy  using  features  transformed 
with  our  technique  across  all  168  speaker  models  was  58.3%.  This  compares  favorably  to 
the  5.9%  with  no  channel  mismatch  compensation  and  the  27.4%  from  previous  research 
with  different  features,  MFCCs,  and  another  compensation  technique. 

5.3  SID  with  Transformed  Features 

The  best  results  for  channel  compensation  were  when  structuring  the  ANNs  with  nine 
inputs  and  one  output,  and  then  taking  all  of  the  transformed  output  feature  data  except 
the  pitch.  Although  good  results  occurred  when  using  the  cleaned  pitch,  much  better 
results  (typically  20%  improvement)  were  given  when  substituting  the  original  NTIMIT 
pitch  values.  This  improvement  was  also  verified  when  using  the  other  feature  combinations 
which  lacked  bandwidth.  When  comparing  various  examples  of  the  training  inputs  and 
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outputs,  pitch  does  not  seem  to  transform  well  to  the  ideal  targets.  Theortically,  this 
probably  is  due  to  the  lower  cross-correlation  between  the  pitch  and  the  other  features. 

When  comparing  many  of  the  inverse  covariance  matrices  HTK  2.0  gives  for  the 
GMMs  of  each  speaker,  calculations  show  the  autocorrelation  of  the  pitch  to  typically 
be  the  lowest  value  of  any  element.  The  autocorrelation  values  often  are  two  orders  of 
magnitude  less  than  the  other  autocorrelation  values.  And  the  cross-correlation  values 
with  pitch  are  most  often  among  the  average  or  lower  values  of  the  matrices.  Consistent 
in  a  general  sense  with  Sambur’s  results  [8]  was  the  outcome  that  pitch  was  a  valuable 
feature  and  did  improve  SID. 

The  importance  of  the  fourth  formant  with  or  without  its  bandwidth  was  shown  by 
Sambur,  as  is  also  shown  by  the  pre-  and  post-compensation  accuracies.  It  seems  the  ANNs 
are  able  to  reconstruct  the  missing  fourth  formant  from  the  available  formant  structure 
and  the  cross-correlation  of  the  features.  It  also  seems  the  presence  of  the  fourth  formant 
in  the  ANNs  greatly  assists  in  the  reconstructing  of  the  third  and,  to  a  lesser  extent,  the 
second  formant. 

5-4  Final  Thoughts 

We  have  demonstrated  relative  improvements  in  cross-channel  SID  accuracy  from  6% 
to  about  60%.  Recognizing  the  best  previously-published  rates  using  other  compensation 
methods  was  near  27%  [13],  this  technique,  theoretically  and  experimentally,  seems  to  hold 
great  promise  for  eventually  solving  the  channel  mismatch  problem. 

5.5  Recommendations  for  Further  Study 

The  results  proved  promising  for  a  new  avenue  for  channel  mismatch  compensation. 
Further  use  of  ANN  software,  with  MM-NNT  and  other  types,  along  with  different  ar¬ 
chitectural  schemes  should  be  employed.  Other  features  should  be  looked  at  which  have 
reasonable  cross-correlation  which  may  assist  in  the  transformation  through  ANNs.  A 
logical,  direct  follow-on  would  be  to  test  the  transformed  features  in  a  speaker  verification 
task.  But  this  mapping  technique  may  also  be  applicable  to  other  fields  of  research. 
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VI.  Appendix 


6.1  Artificial  Neural  Networks  Theory 

6.1.1  Network  Classification  Theory.  The  simplest  classification  problem  is  the 
two-class  problem  with  the  use  of  the  linear  discriminant  function,  i.e.  if  y(x)  >  0  then 
classify  a;  as  a  member  of  class  one;  if  less  than  zero  classify  as  a  member  of  class  two.  The 
general  equation  is  [5] 


y(x)  —  wTx  +  Wq 


(6.1) 


where  w  is  a  d-dimensional  weight  vector.  The  threshold  or  offset,  wo,  is  often  referred  to 
as  the  bias.  Equation  6.1  corresponds  to  a  hyperplane  of  dimensionality  (d  -  1)  .  If  we 
set  the  offset  to  zero  and  set  two  points,  x\  and  x2,  on  the  hyperplane  boundary  where 
y(x)  =  0,  equation  6.1  becomes 

wt(x2  —  ®i)  =  0.  (6.2) 

Therefore,  the  weight  vector  is  geometrically  normal  to  any  x  vector  contained  in  the 
hyperplane.  As  mentioned,  the  bias  determines  the  offset  position  of  the  hyperplane. 
Hence,  a  Unear  discriminant  function  can  be  represented  by  Figure  6.1,  where  the  input 
a:0  is  permanently  set  to  unity.  This  is  often  referred  to  as  a  single-layer  perceptron. 


Figure  6.1  Simple  Artificial  Neural  Network 


6.1.2  Multi-class  ANN. 
each  class,  class  k,  has  its  own 


An  easy  extension  to  multiple  classes  is  performed  if 
discriminant  function.  The  hyperplane  decision  boundary 
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IS 


(wk  -  Wj)T X  +  (wk 0  -  Wjo)  =  0.  (6-3) 

The  network  from  Figure  6.1  has  been  changed  for  multiple  classes,  as  demonstrated  in 
Figure  6.2.  Each  connection  has  an  associated  weight  with  it.  The  network  outputs  can 
now  be  expressed  by 

yk(x)  =  wkx  +  wk0  (6.4) 

or 

d 


ykW)  =  X] WkiXi  +  Wk°> 


Figure  6.2  Single-Layer,  Multiple  Output  Network 


these  activation  functions  may  perform  a  nonlinear  distortion  to  this  sum  to  obtain  the 
desired  outputs. 

6.1.3  Multilayer  Perceptron  Theory.  We  now  extend  these  single-layer  networks 
to  multiple-layered  networks,  where  the  outputs  of  the  first  layer  become  the  inputs  of  the 
second.  They  are  called  hidden  layer  nodes.  Refer  to  Figure  6.3. 

They  are  obtained,  as  before,  by  a  linear  combination  of  d  weighted  inputs,  along 
with  a  bias  term. 

ai(x)  =  ThWfiXi  +  wj0X0  =  mWfiXi  (6-6) 

i=l  t=0 
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Figure  6.3  Multilayer  Perceptron 

where  denotes  a  first  layer  weight  from  input  node  i  to  hidden  node  j.  Combining  the 
fact  a  is  the  input  the  activation  functions  operate  on  and  the  existence  of  layers  of  hidden 
weights,  the  output  of  the  k  nodes  are 

v»M  =  i2)  (f>8V1)  (E”**)  J  (6-7) 

with  the  numbers  in  paraentheses  relating  to  the  assigned  layer  affiliation.  The  concept 
can  be  extended  to  multiple  layers  of  weights. 

6. 1.3.1  Nonlinear  Activation  Functions.  Now  consider  a  function,  g(w,x ), 
called  an  activation  function,  which  operates  on  the  weighted  sum.  A  popular  nonlinear 
function  is  the  logistic  sigmoid,  (1  +  ex)_1;  another  is  the  hyperbolic  tangent.  Since  the 
updating  of  the  weights  for  convergence  is  based  on  the  partial  derivatives  of  the  error 
function,  the  logistical  sigmoid  function  is  useful  since  its  derivative  is  merely  a  function 
of  itself,  g'(x )  =  g(x)  *  [1  -  g(x)].  This  leads  to  easier  calculations  and  weight  updates. 
Although  linear  functions  are  used,  the  logistic  sigmoid  and,  a  sibling,  the  hyperbolic 
tangent,  are  good  for  dealing  with  nonlinearities.  This  fact  is  evident  by  the  Exclusive- 
OR  data  separation  problem  example  in  [2].  [1]  The  problem  was  solved  more  readily 
with  a  sigmoid-sigmoid  or  tanh-tanh  combination  for  the  activation  function  of  the  first 
and  second  layer  of  weights  than  with  a  combination  which  included  one  layer  of  linear 
functions. 

6.1.4  ANN  parameters.  Often  with  classification  tools,  training  must  occur,  and, 
therefore,  a  measurement  for  validity  must  be  used.  One  often  used  is  the  Sum-of-Squared 
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Errors  (SSE),  [2]. 


E  =  i  ]P(4  -  zjt)2  (6.8) 

J  fc=i 

where  dk  is  the  desired  output  and  is  the  actual  output.  As  adapation  of  the  weight 
for  optimum  performance  is  desired,  an  update  rule  for  them  must  be  established  with 
relation  to  their  impact  on  the  SSE.  This  rule  is  [2], 

Wh+  =  Wh^-V-^-  (6.9) 


which  is  equivalent  to 


A  Wk  =  -v 


6E 
6Wh _ 


(6.10) 


all  representing  vector  notation,  where  W_  is  the  updated  set  of  weights,  W-  is  the  old  set 
of  weights,  h  is  the  perceptron  layer,  and  77  is  learning  rate  [2].  This  formula  uses  gradient 
descent,  which  it  seeks  out  the  minimum  SSE  by  use  of  the  partial  derivative  of  the  error. 
An  adaptive  learning  rate  is  best;  for  example,  an  77  which  is  inversely  proportional  to  the 
weight  index  number.  This  often  reduces  oscillations  in  SSE  plot,  since  smaller  corrections 
as  proceeding  towards  convergence  is  desired.  Another  option  is  to  tie  77  to  the  difference 
between  the  current  output  and  actual  target  value. 

In  addition  to  an  aeta  term,  a  momentum  term  may  be  used  to  avoid  local  minimums 
in  error,  and  thereby  find  the  total  minimum  error.  Momentum,  which  is  proportional  to 
the  change  in  the  weight  values  resulting  from  the  last  update,  often  speeds  up  convergence 
[2].  With  momentum,  the  weight  update  equation  becomes 


A  Wh  =  -7? 


SE 

6Wh. 


+  M  wh 


(6.11) 


6.1.5  Backpropagation.  Error  backprogation  is  the  process  where  the  partial 
derivatives  of  the  ANN  output  errors  with  regards  to  the  second  layer  of  weights  is  passed 
back  to  evaluate  the  partial  derivatives  of  the  outputs  of  each  hidden  layer  node  with 
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regards  to  the  first  layer  of  weights.  It  is  due  to  backpropagation  that  updating  weights 
become  more  efficient  [1].  Again  realizing  the  output  of  the  activation  function  is 


Zj  =  g(aj)  =  gC^WjZi)  (6-12) 

i 

the  drive  is  to  minimize  the  error  in  relation  to  the  weights;  the  weight  is  connecting  nodes 
i  to  j,  and  refers  to  errors  of  the  hidden  units  relating  to  inputs.  The  errors  of  the 
network  output  units  are  the  same,  merely  substituting  k  for  j  as  layer  indicator. 

As  the  SSE  metric  demonstrates,  the  error  of  an  ANN  is  a  differentiable  function 
of  output  variables.  For  backpropagation,  the  error  relationship  of  network  weights  to  be 
minimized  is  found  by  combining  equations. 

/ 

6En  6En  6aj  a 

- - =  '■  =  djZj  (6.13) 

OWji  bXVji 


From  the  chain  rule,  the  input  error  relationship  for  the  output  and  hidden  nodes  is, 
respectively, 


6En  „  .  6En 
Sat  ~  9  <0i)  Syt 


(6.14) 


and 


6En  6En  Sak 
6aj  "  Sak  Saj  ' 


(6.15) 


Note  in  the  latter  equation,  the  sum  encompasses  all  the  outputs  k  which  is  connected  to 
j.  Note  also  this  equation  demonstrates  variations  in  aj  ,  the  weighted  sum  of  network 
inputs  impact  the  a*  variables,  the  weighted  sum  of  the  upper  layer  inputs.  Combining 
equations,  the  hidden  unit  input  errors  can  be  calculated  by  the  backward  propagating  the 
input  errors  from  the  output  nodes. 
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(6.16) 


=  g'(aj)  E  Wk^k 

k 


6.2  Gaussian  Mixture  Model  Classification  Theory 

6.2.1  Semiparametric  Classification  Theory.  The  probability  density  function  of 
mixture  models  is  formed  from  a  linear  combination  of  basis  functions. 

M 

p(x) = Ep(x  1 3)pu)  (6-17) 

j= i 

To  demonstrate  the  use  of  Gaussian  Mixture  Models  (GMMs)  and  the  optimization  algo¬ 
rithm,  the  connect  with  posterior  probabilities  is  investigated  via  Bayes’  theorem  [1] 


P(j  |  x)  = 


Pix  I  j)P(j) 

p(x) 


(6.18) 


The  value  of  P(j  \  x)  represents  the  probability  that  component  j  was  responsible  for 
creating  the  data  point  x.  The  log-likelihood  equation,  a  type  of  error  function  is  given 
as  [1] 

N  N  M 

£  =  -ln£  =  -Eln  ?(*»)  =  -  E ln  { Ep(XTO  I  j)pU)}  (6-19) 

71  =  1  71=1  j  =  1 

The  error  is  minimized  by  maximizing  the  likelihood  score,  i.e.  the  Maximum  Aposteriori 
Probability  (MAP). 


6.2.2  Expectation- Maximization  Theory  for  Solving  the  Mixture  Model  Parameters. 
Given  the  adjustable  parameters  of  this  formula  are  the  mean,  variance  and  prior  proba¬ 
bility,  the  desire  is  to  minimize  this  error  function  in  relation  to  these  parameters.  Partially 
differentiating  E  with  respect  to  each  parameter,  we  get 


8E 

6pj 


N 


71  =  1 


(6.20) 
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(6.21) 


SE 

6a j 


^2pu  \x^{~ 


and 

(6.22) 

4'«  „i 

where 


i»(j)  = 


e«p(7j) 
Efcli  eajP(7Jfc) 


(6.23) 


Setting  the  derivatives  equal  to  zero,  and  solving  yield  the  highly  nonlinear  trinity  of 
equations 


H  = 


£n=l  PQ_  I 

ZLlPU\*n)  ’ 


(6.24) 


^  3 - ’  (M,) 

and 

(S'2®) 

71  =  1 

which  can  only  be  solved  in  a  practical  sense  by  an  iterative  process.  Baum- Welch  [5] 
devised  a  way  to  better  optimize  through  an  iterative  process  by  more  carefully  calculating 
updates  to  the  iterations.  The  error  function  is  manipulated  into  an  updating  process, 


En 


Eoid  =  ~y^ln 


Pnewj^n) 

Pold(xn) 


The  resulting  updatable  parameter  equations  are 


_  P old{j_  \  %n)%n 

Sw=l  PoldU  I  xn ) 


(6.27) 


(6.28) 
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and 


(&j,new) 


iEti 

d 


Ppldjj.  |  xn]  ll^n  P-j,nevj\\ 

En=l  Pold(j  I  Zn) 


(6.29) 


J  n=l 


(6.30) 


6-8 


X  fbf02htkgfbpu.m 

X  Edmund  Fitzgerald  22SEP97,  some. from  lines  from  Arb’s  files  #/.fbf 02htkfpuo2 .m 
'/,  Takes  default  feature  vectors  (unvoiced)  of  the  ,fb  files,  combines  4-formants 
•/,  and  4-bandwidths  into  single  feature  vectors  of  8  features  and  strips 
%  unvoiced  records  from  them,  via  finding  ESPS-imposed  defaults  for  unvoiced; 

X  adds  pitch  also 

clear  all 
close  all 
f  ilenames=  []  ; 
indexunvs=  []  ; 
f ilecounter=l; 

[f idl , message] =f open( [ ’ /home/fugglesl/ ef it zger/toy/timit/ test /dr5/ total . gf O']  , 'rO 

X  listing  of  get_fO  files 

diff  lengs=[]  ; 

done=0; 

counter=0; 

while  "done 

nextline=fgetl(f idl) ; 

if  ~isstr (nextline)  '/,  at  end  of  path  list 
done=l ; 

else  X  not  at  end  of  path  list 
f borf 0=f liplr (abs (nextline) ) ; 

if  fborf 0(1)==48  X  or  .fO  file 
f lag=0; 

flagnum=abs(flag) ; 
else 
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flag=l; 

flagnum=abs(flag) ; 
end 


'/,  now  Behead  and  copy  back  w /  new  extension;  but  work  on  tmpf eatgfbpudr5 .fbx 
cd  /home/f ugglesl/ef itzger/toy/timit/test/dr5 ; 

eval( [’ ! cp  ’.nextline,’  /home/fugglesl/efitzger/toy/timit/test/dr5/tmpfeatgfbpudr5.fb2’ 
eval  (  [  ’  !  bhd  tmpfeatgfbpudr5  .fb2  tmpf  eatgfbpudr5  .fbx’]  )  ;  ’/.data  to  tmpf  eat  gfbpudr5.  fbx 
cd  /home /f ugglesl/ef itzger/toy/timit/test/dr5 
if  flag==l 
flag 

*/.  pull  in  beheaded  file  to  put  into  new  matrix 

[f  id2 ,message]=f open ( ’tmpf eatgfbpudr5 .fbx’ ,  ’r* ,  ’b’)  ;  '/,; 

A=fread(fid2,inf , ’float64’) ; 
f close(f id2) ; 
counter=counter+l ; 

Arows=size(A,l) ; 

Brows=f ix(Arows/8) ; 
trash=Arows-(Brows*8) ; 

A=A(l:Arows-trash,l) ; 

B=reshape(A,Brows,8)  ;  '/.  now  the  matrix  is  8  (featgfbpu)  by  whatever 
'/,FMBW=B  * ; the  mat2fea  needs  x  by  8  matrices 
C=reshape  (B ,  8 ,  Brows)  ; '/, 

FMBWI=C> ; 

'/,  Now  change  order  for  easy  missing  features:  f2f 3bw2bw3f Ibwlf4bw4 
FMBWX=  []  ; 

FMBWX=FMBWI  ’ ;  '/.  8  by  x 

FMBW= [FMBWX (2 , : ) ; FMBWX (6 , : ) ; FMBWX (3 , : ) ; FMBWX (7 , : ) ; FMBWX ( 1 , : ) ; FMBWX (5 , : ) ; FMBWX (4 . 

. . . FMBWX (8 , : ) ] ; 

'/.F2B2F3B3F1B1F4B4 
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'/,•/, ‘/.Not  till  FO  incorporated  FMBW=FMBW  ’ ;  get  rid  of  it  use  nnz  to  reshape  again  for... 
. . .  size 

stringfile=abs(nextline) ; 
for  index=l : length (stringf ile) 
if  stringf ile (index)==47  %  sll5 

if  stringf  ile  (index+l)==115  '/,  /47  .46 

if  (length (stringf ile)==65)  &  (stringf ile(index+4)==46)  ;'/,  . 

f ilename(l : 3)  =stringf ile (index+1 : index+3)  ;  '/.  sal  obtained,  what  about  sa3456 
elseif  (length  (stringf  ile)  ==66)  &  stringf ile (index+5)==46  . 

f ilename(l  :4)=stringf ile(index+l : index+4)  ;  '/,  sal  obtained,  what  about  sa3456 
elseif  (length  (stringf  ile)  ==67)  Sc  stringf  ile  (index+6)  ==46  ;'/,  . 

f  ilename(l  :5)=stringf  ile(index+l :  index+5)  ;  '/.  sal  obtained,  what  about  sa3456 
elseif  (length(stringf  ile)==68)  stringf  ile  (index+7)==46  . 

filename(l :6)=stringfile(index+l: index+6)  ;  '/,  sal  obtained,  what  about  sa3456 
end  7, elseif 
end  7.  115 
end  7.47 

end  */,  for  index  varname=  [filename  f lagnum]  ; 

f  ilenameB=setstr(f  ilename)  ; 

save  /home/fugglesl/ef itzger/dr5featgfbpu  FMBW  filenameB 
end  '/,  flag==l 

if  flag==0 

load  /home/fugglesl/ef itzger/dr5f eatgfbpu  '/.takes  in  the  4fms  and  4bws 
adjust=0;  '/,  now  obtain  pitch  and  prob  of  voicing 

eval ( [ ’ ! f ea2mat  -f  FO  ’ ,nextline,’  /home/fugglesl/ef itzger/toy/timit/test/ .. . 

. . .tmpf eatgfbpudr51 .mat  *]  ) ; 

eval (  [  ’ !  f ea2mat  -f  prob.voice  ’.nextline,’  /home/fugglesl/ef itzger/toy/timit/test/ .. . 

. . .tmpf eatgfbpudr52 .mat ’] ) ; 

load  /home/fugglesl/ef itzger/toy/timit/test/tmpfeatgfbpudr51 
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load  /home/fugglesl/ef itzger/toy/timit/test/tmpf eatgfbpudr52 
strayzero=0 ; 

for  nonzeroloopl=l: size (FMBW, 1) 
for  nonzeroloop2=l:size(FMBW,2) 

if  FMBW(nonzeroloopl,nonzeroloop2)==0 

FMBW(nonzeroloopi,nonzeroloop2)=0.01  '/, eliminates  stray  zeros  so  no  nnz  ... 
...confusion  w/Pr<0.5 
strayzero=strayzero+l ; 
end  '/.if 

end  '/,  for  nonzeroloop2 
end  '/,  for  nonzeroloopl 

diffleng=length(FO) -max (size (FMBW)) 

F0=F0 ’ ; 

if  length(FO)>max(size(FMBW)) 

F0=F0 (l,l:max( size (FMBW) )) ;  '/, incase  of  a  one  frame  discrepancy  ;v&unv  regions  ok 
end  '/.since  ,  10ms  overlap  of  20ms  frames 
if  length(FO)<max(size(FMBW)) 

FMBW=FMBW(:  ,1:  length (F0)) ;  '/.FMBW  already  made  into  matrix  from  first  time  through; 
end  '/.Now  incorporating  pitch  into  matrix 
FMBW0= [] ; lengF0=length (F0) ; 

[max (size (FMBW))  length(FO)  1]; 
counterzero=0;  tolerout=0; 

FMBW0= [FMBW ( 1 , : ) ; FMBW (2 , : ) ; FMBW (3 , :) ;FMBW(4, :) ;F0;FMBW(5, :) ;FMBW(6, :) ;FMBW(7 ,:);... 

. . .FMBW (8, : )] ; 

FMBW=FMBWO  ’ ;  '/,  Now  ready  to  output 

tolerl=mean(FMBW(: , 1) )+2*std(FMBW( : ,1)) ;tolerln=mean(FMBW( : ,l))-2*std(FMBW(: ,1)) ; 
toler2=mean (FMBW ( : ,2))+2*std(FMBW( : ,2)) ;toler2n=mean(FMBW( : ,2))-2*std(FMBW(: ,2)) ; 
t oler3=mean (FMBW ( : , 3) ) +2*std (FMBW ( : , 3) ) ; t oler3n=mean (FMBW ( : , 3) ) -2*std (FMBW ( : , 3)  )  ; 
toler4=mean(FMBW( : ,4))+2*std(FMBW( : ,4)) ;toler4n=mean(FMBW( : ,4))-2*std(FMBW( : ,4)) ; 
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toler5=mean(FMBW(: ,5) )+2*std(FMBW( : ,5)) ;toler5n=mean(FMBW( : ,5))-2*std(FMBW( : ,5)) ; 
for  indexrec=l:length(FO) 

'/.NOT  length(prob_voice)  since  FO  might  have  been  corrected  by  one  frame=FMBW  leng 
if  prob_voice (indexrec) <0 . 5  '/.'/.'/.ANY  of  the  following  will  zero  out  a  feature  vector 
counterzero=counterzero+l ; 

FMBW( indexrec, : )=zeros (1 ,9)  ; 

elseif  FMBW ( indexrec , 1 ) >t o ler 1  I  FMBW(indexrec,2)>toler2 
tolerout=tolerout+l ; 

FMBW (indexrec, :)=zeros(l,9) ; 

elseif  FMBW (indexrec, 2) ==1000  &  FMBW(indexrec,4)==1000 
tolerout=tolerout+l ; 

FMBW (indexrec, :)=zeros(l,9) ; 

elseif  FMBW (indexrec , 3) >toler3  I  FMBW(indexrec,4)>toler4  |  FMBW(indexrec ,5)>toler5 
tolerout=tolerout+l ; 

FMBW (indexrec, :)=zeros(l,9)  ; 

elseif  FMBW (indexrec, l)<tolerln  I  FMBW(indexrec ,2)<toler2n  I  ... 

. . .FMBW(indexrec,3)<toler3n 
tolerout=tolerout+l ; 

FMBW(indexrec, :)=zeros(l,9) ; 

elseif  FMBW (indexrec , 4) <t oler4n  I  FMBW(indexrec ,5)<toler5n 
tolerout=tolerout+l ; 

FMBW (indexrec, :)=zeros(l,9) ; 
end  '/.if 

end  '/,  for  indexrec 

'/,sizeofNZideal=9*  (length (FO) -counterzero)  ; 

FMBWNZ=nonzeros  (FMBW)  ;  '/.check  for  stray  zero 

[max (size (FMBWNZ))/9  length(FO)-counterzero-tolerout  counterzero  2]  '/,  check  sizes 
if  f lag~=2 

FMBWP=reshape  (FMBWNZ ,  length  (FO)  -counterzero-tolerout  ,9)  ;  '/.New  matrix 
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FMBWP=FMBWP  * ; 

'/,  since  .gfO  vs  .fb  or  .fO  (2  char) 
stringf ile=abs(nextline) ; 

stringf ile=stringf ile(l : (length(stringf ile)-l)) ; 
setstr (stringf ile) 
for  index=l:length(stringfile) 
if  stringf ile(index)==47  '/,  sll5 

if  stringf  ile  (index+l)==115  '/,  /47  .46 

if  (length (stringf ile)==65)  k  (stringf ile(index+4)==46)  . 

filename(l:3)=stringfile(index+l:index+3)  ;  '/,  sal  obtained,  what  about  sa3456 
elseif  (length(stringf  ile)==66)  k  stringf  ile  (index+5)==46  . 

f ilename(l : 4) “stringf ile (index+1 :  index+4)  ;  '/.  sal  obtained,  what  about  sa3456 
elseif  (length(stringf ile)“=67)  k  stringf ile(index+6)==46  . 

f ilename(l  :5)=stringf  ile (index+1 : index+5)  ;  '/,  sal  obtained,  what  about  sa3456 
elseif  (length(stringf  ile)==68)  stringf  ile  (index+7)==46  . 

filename  (1:6)  =stringf  ile  (index+1 :  index+6)  ;  '/.  sal  obtained,  what  about  sa3456 
end  '/, elseif 

'/,if  (length(stringfile)==69)  k  (stringf ile (index+7)==46)  '/,  . 

'/, other  order  did  not  work 
end  '/,  115 
end  */,47 

end  '/,  for  index 

varname= [filename  f lagnum] ; 

filenameA=setstr (filename)  ;  '/■; 

absf ile= [39 , abs (nextline (1 : 58) ) , abs ( ’ /f eatgf bpu/ ’ ) , abs (f ilenameA) ,46,104,116,50,39]; 
f ilename=setstr (absf ile) 

f ilenameA  '/,  check 
filenameB 

if  f ilenameA==f ilenameB 
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*/.  don’t  write  out  until  FO  (pitch)  taken  in 
'/,  Keep  all  variable  in  a  dialect  region, then  cat  those  F0,FB  same  spkr 
eval ( ( [ ’ Imkdir  ’ ,nextline(l :58) , ’/f eatgfbpu’]  )  ) ; 
eval ( ( [ ’ ! chmod  -fR  g+w,g+x  ’ .nextline (1 : 58) , ’/featgfbpu’]  )) ; 

cd  /home/hawkeyel9/97d/ef itzger/thesissum/SID/makefbwp/fbf 02htk 

numf eat=min(size(FMBWP)) ; 

FMBWMAT=FMBWP ; 

FMBWT=FMBWP ; 

FMBWQ=reshape (FMBWT ,max (size (FMBWT) ) *min (size (FMBWT) ) , i) ; 

FMBWP=reshape(FMBWQ,max(size(FMBWP)) ,min(size(FMBWP))) ; 

V,  Developed  empirical  method  ("Plaid-Shirt  Method")  to  put  into  HTK  2.0  format 
eval( [’w_error=whtkparm(FMBWP, ’ .filename, ’);’]) 

*/,  eval  ([’  w_error=alwrthtkwav(FMBWP,  ’ .filename ,’ ,numf  eat);  ’]) 
if  w_error==-l 
w_error 
f lag=2 
end 

if  f lag“=2 

[max (size (FMBWI))  max(size(F0))  max(size(FMBWP))] 

FMBWT=FMBWP ’ ; 

eval ( ( [ ’ save  » ,nextline(l :58) , ’/featgfbpu/’ .filenameB, ’  FMBWMAT  adjust  strayzero  ... 

. .  .tolerout; ’]))  ;  '/,, counter  is  simple  distinguishment 
end  */,if  flag~=2 
end  '/,if  f  ilename==f  ilenameB 
elseif  flag==2 
end  '/,  if  flag"  =2 
end  '/,  if  FO 

difflengs=[difflengs;diffleng] ; 

clear  FMBW  FMBWP  FO  FMBWX  FMBWO  filenameB  filename  FMsize  No  confusion  with  other 
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. . .FMBW  FO  data 
end  y.isstr 

end  '/, while  loop 
diff lengs  * 

status=f close(f idl) ; 


nnetpx921inovl3e2  .m  lnov97 
•/,  Captain  Edmund  Fitzgerald 

*/,  This  is  for  nine  input,  one  output  mapping  ANN  generation  using  training  data 
'/,  and  mapping  of  the  corresponding  test  features.  Would  be  difficult  to  insure 
•/,  correct  ANN  with  corresponding  test  utterances  if  separated  to  two  programs. 

7. 

close  all 
clear  all 
flops (0) 
tic 

hidcounter=0;ptrnmatch  =0;ptrnmiss  =0;ptstmatch  =0;ptstmiss  =0; 
flopnum=[]  ;T=[]  ;P=[]  ;totrand=75; 
convergetry=0 ; f miterx=0 

[f idl , message] =f open(  [* /home/fugglesl/ ef itzger/toy/ntimit/test/ drl/totalf bptnt . tst ']  ,  }r , ) 
•/,  list  of  test  feature  file  paths 

done=0 ; 
tstcounter=l ; 
while  "done 
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spkrtemp=fgetl(f idl) 


if  ~isstr(spkrtemp)  ’/.  at  end  of  path  list 
done=l ; 

else  '/*  not  at  end  of  path  list 

■/.strip  off  speaker  name  for  variable  attachment 
ntimortim=(abs(spkrtemp)) ; 

'/.for  it er=l : length (ntimortim) 

if  ntimortim(29)==110  '/,  'n*  ,116=t 
f lag=0;  */.; 
flagnum=abs(flag) ; 
spkr=spkrtemp(45:49)  */,if  ntimit 
else 
f  lag=l 

f lagnum=abs(f lag) ; 
spkr=spkrtemp(44:48)  */#if  timit 
end 

if  flag==0 

%  Will  bring  in  2  tst  sentences  for  conversion  with  the  trained  NNet 

if  tstcounter==l 

spkrtempl=spkrtemp;  */.  hold  first  test  sentence 
'/,  Now  generate  nnet..only  do  once/spkr/f eature 

•/,  The  training  data  is  combined,  not  the  test  data,i.e.  test  individual  utters 
evalC [’load  /home/fugglesl/ef itzger/toy/ntimit/test/drl/ > ,spkr , ,/FMBWPNtrn,]  ) ; 
•/.  loading  NTIMIT  training  data,  i.e.  formants,  bandwidths,  and  pitch  (trn) 

FMBWPNs=FMBWPNs J ; 
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'/,FMBWPNs=  [FMBWPNs ( 1 , : )  ; FMBWPNs (2 , : )  ; FHBWPNs (4,:)]; 

p=(l/10000)  . *FMBWPNs;  '/,  Need  to  divide  data  by  constant  to  be  input  to... 
...nonlinear  activation  functions. 

eval( ['load  /home/fugglesl/ef itzger/toy/timit/test/drl/ ’ ,spkr, VFMBWPTtrn’] ) 
'/,  loading  corresponding  TIMIT  data 
FMBWPT  s=FMBWPTs * ; 

Flts-FMBWPTs(lf :) ;F2ts=FMBWPTs(2, :) ;F3ts=FMBWPTs(3, :) ; 

F4ts=FMBWPTs (4 , :) ;F5ts=FMBWPTs(5, :) ;F6ts=FMBWPTs(6, :) ; 

F7ts=FMBWPTs (7 , : ) ;F8ts=FMBWPTs(8, :) ;F9ts=FMBWPTs(9, :) ; 
for  fmiter=l:9  Kneed  an  end  later 

eval( [,Fts=F’ , int2str(f miter) , *ts; *]) ; 

T=(l/10000) . *Fts; 

Fl^tansig’ ;F2=’logsig’ ;  */, activation  functions 

hidvctr= [45] ;  '/,  Used  45  hidden  layer  nodes 
maxcount=max(size (hidvctr) ) ; 

'/,  ANN  parameters 

for  hidcounter=l :maxcount  */,use  same  randn  dat  in  e  xter  fen  10x,rtrn 
trnmsclsf y=0 ; tstmsclsf y=0 ; trnmatch=0 ; tstmat ch=0 ; 

'/.note  learnbpm.m  is  contained  in  trainbpx.m/tbpx3.m 
dsplyepochs=50 ;  maxepochs=300;  sse=0.01;  lr=0.0i; 
lrinc=1.05;  lrdec=0.5;  mo=0.90  ;maxerr=l .04; 

'/,  defaults:  dsplyepochs=25;  maxepochs=100 ;  sse=0.02;  learnrate=0 .01 ; 
Xlrinc-1.05;  lrdec=0.7;  momentum=0.9;a)0.7,0.96;  b) 0.5, 0.96  maxerr=l .04; 

tp= [dsplyepochs .maxepochs , sse , lr , lrinc , lrdec ,mo ,maxerr] ; 
hidden=hidvctr(hidcounter) ; 
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[Wl,bl,W2,b2]=initff (P, hidden, FI, min(size(T)) ,F2) ; 

[W1 , bl , W2 , b2 , t e , tr] =trainbpx (W1 , bl , FI , W2 , b2 , F2 , P , T , tp) ; 

'/,  ANN  trained  using  training  data 

format  bank 

[’dsplyepochs  ’  ’maxepochs  *  ’sse  ’  ’lr  ’  ’lrinc  ’  ’lrdec  *  ’mo  ’  ’maxerr  ’  'hidden... 

. . .-nodes'] 

'/, parameters*  [dsplyepochs,  maxepochs,  sse,  lr,  lrinc,  lrdec,  mo,  maxerr,  FI,  F2] 

[tp,  hidvctr(hidcounter)] 
f ilenamex* ' pxnovl3_drlzde2 ' 

eval(['save  /home/hawkeyel9/97d/ef itzger/thesissum/SID/makefbwp/makegmm/gmmfpu/nnet . . . 

. . ./netwtsnovl3/pxnovl3_drlzde2’ ,num2str(f miter) ,spkr,num2str (hidden) , . . . 

. . .num2str (maxepochs) , ’  W1  bl  W2  b2  te  tr  tp  flopnum’]); 
move  downend  '/,  of  hidcounter  loop 
f ilenamex* ’pxnovl3_drlzde2 ’ 

’/.Even  number  times  of  bringing  in  files,  skipped  over  making  nnet..go  to  (.ht2)  o/p... 
spkrtemp  '/,  log  file  check 

eval([’load  ', spkrtemp]  );'/, combine  tim  and  ntim  separately,  but  insure  same  length 
FMBWPT=FMBWPT . /10000 ;  ’/,  TIMIT  test  formants  bandwidths  and  pitch  ;divide  by  const 
FMBWPN=FMBWPN . / 10000;  '/.NTIMIT  test  formants  bandwidths  and  pitch  from  NTIMIT...div 
p=FMBWPN ; 

t=simuff  (p,Wl,bl,Fl,W2,b2,F2) now  map  the  test  data  using  the  trained  ANN 
eval( [’Nop’ ,int2str(f miter) , ’=t; ’]  ) ; 

meanNopl=10000.*mean(Nopl) 

if  meanNoplClOO  ’/,  This  is  decision  area;  if  true  then  test  data  transformed  badly  so 
•/,  ...try  the  ANN  generation  again  (using  training  data) 

’/,  The  following  loop  insures  a  convergence  to  reasonable  SSE 
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fmiterx=fmiter  '/,'/, 
convergetry=convergetry+l 


while  fmiterx==fmiter  '/.go  through  nnet  generation  until  meanNop>100 

mmxmmx 

T=(l/10000) . *Fts; 

Fl=’tansig’ ;F2=’logsig’ ; 
hidvctra® [45] ; 


'/.use  same  randn  data  in  exter  ANN  generation  fen  lOx, return 
'/.note  learnbpm.m  is  contained  in  trainbpx .m/tbpx3 .m 
dsplyepochs=50 ;  maxepochs=300 ;  sse=0.01;  lr=0.01; 
lrinc=1.05;  lrdec=0.5;  mo=0.90  ;maxerr=l .04; 

'/,  defaults:  dsplyepochs=25;  maxepochs=100;  sse=0.02;  learnrate=0.01; 

'/,lrinc=l  .05;  lrdec=0.7;  momentum=0.9;a)0.7,0.96;  b) 0.5, 0.96  maxerr=1.04; 

tpa= [dsplyepochs .maxepochs , sse , lr , lrinc , lrdec , mo .maxerr] ; 
hiddena=hidvctra; 

[W1 , bl , W2 , b2] =initf f (P , hiddena , FI , min (size (T) ) ,F2) ; 

[W1 ,bl ,W2,b2,te,tr]=trainbpx(Wl ,bl ,F1 ,W2,b2,F2 ,P,T,tpa) ; 
format  bank 

[’dsplyepochs  ’  ’maxepochs  ’  ’sse  ’  ’lr  ’  ’lrinc  ’  ’lrdec  ’  *mo  ’  ’maxerr  ’  ’hidden... 
...-nodes’]  '/.check  progress  in  log  file 

'/.parameters3 [dsplyepochs ,  maxepochs,  sse,  lr,  lrinc,  lrdec,  mo,  maxerr,  FI,  F2] 

[tpa,  hidvetra]  '/.check  in  log  file 
f ilenamex=’pxnovl3_drlzde2’ 

eval( [’save  /home/hawkeyel9/97d/ef itzger/thesissum/SID/makefbwp/makegmm/gmmfpu. . . 

. . . /nnet/netwtsnov!3/pxnovl3_drizde2 ’ ,num2str(fmiter) ,spkr,num2str (hidden) .... 


6-20 


. . ,num2str(maxepochs) , ‘  W1  bl  W2  b2  te  tr  tp  flopnum’]); 

'/.'/.'/.move  downend  '/,  of  hidcounter  loop 
f ilenamex= ,pxnovl3_drlzde2 9 

'/.Even  number  times  of  bringing  in  files,  skipped  over  making  nnet..go  to  .ht2  o/p... 
'/.still  have  if  flag==0 
spkrtemp 

eval(  [’load  » .spkrtemp]  )  ;'/, combine  tim  and  ntim  separately,  but  insure  same  leng 
FMBWPT=FMBWPT . / 10000 ; 

FMBWPN=FMBWPN . / 10000 ; 


p=FMBWPN ; 

t=simuff  (p,Wl  ,bl  ,F1  ,W2  ,b2  ,F2)  ;'/,  we  just  want  #s!conf  matrix 
eval( [’Nop’ , int2str(f miter) , ,=t; ’] ) ; 
eval( [,meanNopx=10000 . *mean(Nop> , int2str(f miter) ,’);’]); 
if  meanNopx<100  '/.junk  data 

fmiterx=f miter ;  '/.instead  of  1 
else 

fmiterx=fmiter+l  '/.instead  of  0 
end  '/.if  four  lines  up 
end  '/.while  fmiterx=l 


mmmmmxmmmmmmmm  End  of  convergence-insurance  loop 


else 

fmiterx=fmiter+l 

convergetry=0 

end  '/.if  31ines  up/mean<100 

'/,  Now  can  format  and  output 

end  '/,f miter 

end  '/.  of  hidcounter  loop 
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end  7,  if  tstcont=l 


7,  Combinations  of  transformed  and  original  NTIMIT  features 
fml=[Nopl;Nop2;Nop3;Nop4;FMBWPN(5, :) ;Nop6;Nop7 ;Nop8;Nop9] ; 

fm2=[Nopl;Nop2;Nop3;Nop4;Nop5;Nop6;Nop7 ;Nop8;Nop9] ; ;7.how  get  pitch  in  here... 
...unchanged  pitched 

f  m3= [FMBWPN ( 1 , : ) ; FMBWPN (2 , : ) ; Nop3 ; Nop4 ; Nop5 ; Nop6 ; Nop7 ; Nop8 ; Nop9]  ; 
f m4= [Nopl ; Nop2 ; FMBWPN (3 , : ) ; FMBWPN (4 , : )  ; Nop5 ; Nop6 ; Nop7 ; Nop8 ; Nop9] ; 
f m5= [Nopl ; Nop2 ; Nop3 ; Nop4 ; Nop5 ; Nop6 ; Nop7 ; FMBWPN (8 , :) ;FMBWPN(9, :)] ; 
f m6= [Nopl ; Nop2 ; Nop3 ; Nop4 ; Nop5 ; FMBWPN (6 , : ) ; FMBWPN (7 , : ) ; Nop8 ; Nop9] ; 
f m7= [FMBWPN ( 1, :) ; FMBWPN (2 ,:) ;Nop3;Nop4; FMBWPN (5, :) ;Nop6 ;Nop7;Nop8;Nop9]  ; 
f m8= [Nopl ; Nop2 ; FMBWPN (3 , : ) ; FMBWPN (4 , : ) ; FMBWPN (5 , : ) ; Nop6 ; Nop7 ; Nop8 ; Nop9] ; 

7.7.7.7.keyboard 
FMBWPNtoTl=10000 . *f  ml ; 

FMBWPNtoT2=10000 . *f  m2 ; 

FMBWPNtoT3=10000.*fm3; 

FMBWPNtoT4=10000 . *fm4 ; 

FMBWPNtoT5=10000 . *f  m5 ; 

FMBWPNtoT6=10000 . *f  m6 ; 

FMBWPNt  oT7=l 0000 . *fm7 ; 

FMBWPNtoT8=10000 . *f m8 ; 

7, Want  NTIMIT  tst  feature  vectors  to  appear  to  be  TIMIT  tst  fvectors  for  classification.. 
. . . /GMMs 

7.Need  to  put  into  .ht2  format  files 
stringf ile=abs(spkrtemp) ; 
stringf ile=stringf ile (1 : (length(stringf ile) ) ) ; 
setstr (stringf ile) 
for  index=l:length(stringfile) 
if  stringf  ile  (index)  ==47  7.  si  15 
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if  stringf ile (index+l)==115  /47  .46 

if  (length (stringfile)  ==82)  k  (stringf ile(index+9)==46)  ;’/,  . 

f  ilename=stringf  ile  (index+6 :  index+8)  ;  '/.  sal  obtained,  what  about  sa3456 
elseif  (length (stringf ile)==83)  k  stringfile (index+10)==46  ;'/,  . 

f  ilename=stringf  ile  (index+6 :  index+9)  ;  '/,  sal  obtained,  what  about  sa3456 
elseif  (length (stringf ile ) ==84)  k  stringfile(index+li)==46  ;'/,  . 

f  ilename=stringf  ile  (index+6 :  index+10)  ;  '/,  sal  obtained,  what  about  sa3456 
elseif  (length (stringf ile)==85)  k  stringf ile(index+12)==46  . 

f ilename=stringf ile (index+6 : index+11) ;  ’/,  sal  obtained,  what  about  sa3456 
end  ’/.elseif 

'/,if  (length (stringf ile)==69)  k  (stringf ile(index+7)==46)  '/,  . 

'/.other  order  did  not  work 
end  ’/,  115 
end  */,47 

end  ’/.  for  index  varname=  [filename  f lagnum]  ; 

basic=setstr (filename) 

absf ilel=[39,abs(spkrtemp(l:59)) ,abs(’/fesgfbpil/>) ,abs (filename) ,46,104,116,50,39] ; 
absf ile2= [39,abs (spkrtemp(l :59) ) ,abs(Vfesgfbpi2/’) ,abs (filename) ,46,104,116,50,39]  ; 
absfile3= [39,abs(spkrtemp(l:59)) , abs (’/f esgf bpi3/’) , abs (filename) ,46,104,116,50,39]  ; 
absf ile4= [39 , abs (spkrtemp (1 : 59) ) , abs ( • /f esgf bpi4/ ’ ) , abs (f ilename) , 46 , 104 ,116,50,39]; 
absf ile5=[39,abs(spkrtemp(l :59)) , abs ( ’/f esgf bpi5/’) , abs (filename) ,46,104,116,50,39]  ; 
absfile6= [39,abs(spkrtemp(l:59)) ,  abs  d/f  esgf  bpi6/’) , abs (filename) ,46,104,116,50,39] ; 
absf  ile7=  [39,  abs  (spkrtemp  (1: 59)  ),  abs  d/f  esgf  bpi7/0  ,  abs  (filename) ,46,104,116,50,39]  ; 
absf ile8= [39 , abs (spkrtemp (1 : 59) ) ,abs ( ’ /f esgf bpi8/ ' ) ,abs (filename) ,46,104,116,50,39]  ; 
’/.absf  ile9=  [39 ,  abs  (spkrtemp (1:59)),  abs  ( ’  /f  esgf  bpi9/  ’ )  ,  abs  (f  ilename)  ,46,104,116,50,39]; 
f ilenamel=setstr (absf ilel)  ;  ’/.  A  TICK  (tick  »)  MARK  IS  ABS  39 

filename2=setstr(absfile2) ; 
f ilename3=setstr (absf ile3) ; 
f ilename4=setstr(absf ile4) ; 

f ilename5=setstr  (absf ile5)  ;  '/.  A  TICK  (tick  ’)  MARK  IS  ABS  39 
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f ilename6=setstr(absf ile6) ; 
f ilenarae7=setstr(absf ile7) ; 
f ilename8=setstr (absf ile8) ; 

*/,f  ilename9=setstr  (absf  ile9)  ; 

'/teval(  [’  Imkdir  ’  ,f ilenamel(l  :71)]  )  ; 

'/.eval (  [ ’ I  chmod  -fR  g+w,g+x  ’  ,f  ilenamel(l  :71)]  )  ; 
eval(( [' Imkdir  * ,f ilenamel(2:60) , */f esgfbpil/ ’] )) ; 
eval (([’! chmod  -fR  g+w,g+x  ’ ,filenamel(2 : 60) , ’/f esgfbpil/ ’])) ; 

eval(( [’ Imkdir  * ,f ilename2(2:60) , ’/f esgfbpi2/ ’] ) ) ; 
eval (([' I chmod  -fR  g+w,g+x  ’ ,f ilenamel (2 : 60) , */f esgfbpi2/ »] )) ; 

eval(( [' Imkdir  ’ ,f ilename3(2:60) , ’/fesgfbpiS/’] )) ; 
eval (([’ I chmod  -fR  g+w,g+x  * ,f ilenamel (2:60) , Vfesgfbpi3/ ’])) ; 

eval( ( [’ Imkdir  ’ ,f ilename3(2:60) , ’/f esgfbpi4/ ’] ) ) ; 
eval ( ( C *  I chmod  -fR  g+w,g+x  ’ ,f ilenamel (2 :60) , ’/fesgfbpi4/ '])) ; 

eval(( [* Imkdir  ’ ,f ilename3(2 : 60) , */f esgfbpi5/ ’] ) ) ; 
eval (([’ I chmod  -fR  g+w,g+x  ’ ,f ilenamel (2 : 60) , ’/fesgfbpi5/ ’])) ; 

eval( C [’ Imkdir  ’ ,f ilenamel (2:60) , ’/f esgfbpi6/ '] ) ) ; 
eval (([’ I chmod  -fR  g+w.g+x  ’ ,filenamel(2:60) , ’/fesgfbpi6/’] )) ; 

eval(( [’ Imkdir  ’ ,f ilename2(2 :60) , ’/f esgfbpi7/ ’] ) ) ; 
eval (([’ I chmod  -fR  g+w,g+x  * ,f ilenamel (2 : 60) , ’/fesgfbpi7/ ’])) ; 

eval((  [’  Imkdir  ’  ,filencime3(2:60)  ,  ,/fesgfbpi8/>]  ))  ; 
eval (([’ I chmod  -fR  g+w,g+x  *  ,f ilenamel (2:60) , ’/fesgfbpiS/’] )) ; 
filename ; 

for  saveiter=l:8  '/,  5N0V 
if  saveiter==l 
FMBWP=FMBWPNtoTl ; 

FMBWT=FMBWPNtoTl ; 
filename=f ilenamel ; 
elseif  saveiter==2 
FMBWP=FMBWPNtoT2 ; 
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FMBWT=FMBWPNtoT2 ; 
f ilename=f  ilename2 ; 
elseif  saveiter==3 
FMBWP=FMBWPNtoT3 ; 

FMBWT=FMBWPNtoT3 ; 
f ilename=f ilename3; 
elseif  saveiter==4 
FMBWP=FMBWPNtoT4 ; 

FMBWT=FHBWPNtoT4 ; 
f ilename=f ilename4 ; 
elseif  saveiter==5 
FMBWP=FMBWPNtoT5 ; 

FMBWT=FMB WPN t  o  T5 ; 
f ilename=f ilename5 ; 
elseif  saveiter==6 
FMBWP=FMBWPNtoT6 ; 

FMBWT*FMBWPNtoT6 ; 
f ilename=f ilename6 ; 
elseif  saveiter==7 
FMBWP=FMBWPNtoT7 ; 

FMBWT=FMBWPNtoT7 ; 
f ilename=f ilename7 ; 
elseif  saveiter==8 
FMBWP=FMBWPNtoT8 ; 

FMBWT=FMBWPNtoT8 ; 
f ilename=f ilename8 ; 
end  '/.if  481ines  up 

FMBWQ*reshape (FMBWT .max (size (FMBWT) ) *min (size (FMBWT) ) , 1) ; 
FMBWP=re shape (FMBWQ , max (size (FMBWP) ) ,min(size(FMBWP))) ; 
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cd  /home/hawkeye 19/97d/ ef itzger/ thesissum/SID/ makef bwp/ f bf 02htk 
eval( ['w_error=whtkparm(FMBWP, ' , filename, * ) ; ']  ) 
f lopnum=f lops ; 

cd  /home/hawkeyel9/97d/ef itzger /thesissum/SID/makefbwp/makegmm/ gmmfpu/nnet/netwtsnovl3 
eval( ['save  pxnovl3_drlzdwts2' ,spkr ,num2str (maxepochs) ,F3,num2str (hidden) , 9  '] ) ; 

•/,  dontoverwrite 
tstcounter^tstcounter+l ; 
if  tstcounter“3 
tstcounter=l ; 
end  '/.21ines  above  ..  .reset 
end  ‘/.tstcounter 
end  ‘/.extra  in  here  for 
end 
end 
end 

end  ‘/.if  isstr 
end  ‘/.while 
flops 
toe 

f close(f idl) 
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